La convergence des modularités structurelles et fonctionnelles des systèmes complexes. (The convergence of structural and functional modularities in complex systems)
نویسنده
چکیده
Background: With the improvement of genotyping technologies and the exponentially growing number of available markers, case-control genome-wide association studies promise to be a key tool for investigation of complex diseases. However new analytical methods have to be developed to face the problems induced by this data scale-up, such as statistical multiple testing, data quality control and computational tractability. Results: We present a novel method to analyze genome-wide association studies results. The algorithm is based on a Bayesian model that integrates genotyping errors and genomic structure dependencies. p-values are assigned to genomic regions termed bins, which are defined from a gene-biased partitioning of the genome, and the false-discovery rate is estimated. We have applied this algorithm to data coming from three genome-wide association studies of Multiple Sclerosis. Conclusion: The method practically overcomes the scale-up problems and permits to identify new putative regions statistically associated with the disease. Background The last years have shown a tremendous increase in the number of markers available for association studies. Previous studies were dealing either with the whole genome at a very low resolution (for instance 5 264 microsatellites in [1]) or with a carefully chosen region of few millions of base pairs [2,3]. Recent technologies allow the genomewide genotyping of hundred of thousands SNPs [4]. This has arisen the need of new methodological developments to overcome different issues, such as the multiple-testing problem, gene biases, data quality analysis and the computational tractability. Firstly, the multiple testing problem seems to cause association studies ability to detect associations to decrease as the number of markers increases. The classical analysis strategy, based on an association test for each marker [5], encounters increasing difficulties as more than one million of markers are available: Increasing the number of markers prevents from the detection of the mild genetic effects expected in complex diseases, as only strong effects from Machine Learning in Systems Biology: MLSB 2007 Evry, France. 24–25 September 2007 Published: 17 December 2008 BMC Proceedings 2008, 2(Suppl 4):S6 Selected Proceedings of Machine Learning in Systems Biology: MLSB 2007 Florence d'Alché-Buc and Louis Wehenkel Proceedings This article is available from: http://www.biomedcentral.com/1753-6561/2/S4/S6 © 2008 Omont et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BMC Proceedings 2008, 2(Suppl 4):S6 http://www.biomedcentral.com/1753-6561/2/S4/S6 Page 2 of 9 (page number not for citation purposes) emerges from the huge noise generated by the increased quantity of data. Methods like False Discovery Rate (FDR) [6] computation allow to control the error rigorously, but do not increase the statistical power. Better strategies based on haplotype blocks are being developed, the first step being gathering such block data (see the HapMap project, [7]). The gain of such strategies is two-folded: (i) the number of tests is independent of the number of markers (ii) the statistical power may be increased if markers of the same haplotype block are not fully correlated. Secondly, a genetic association of a given SNP is a statistical feature and does not explain by itself a phenotype. To biologically interpret an associated marker, its haplotype block should first be delimited. Then, the association can be refined by fine-scale genotyping technologies or ideally by full resequencing. This eventually allows to identify functional mutations. Most of the time, these mutations impact relatively close genes. This is a first argument to bias association analysis towards genes. Moreover, even if haplotype blocks are unreachable, DNA might be cut into distinct regions (called bins) on another basis, so as to limit the multiple-testing problem and make it independent of the number of markers. Combining these two arguments leads to choose one bin for each gene, and to create "desert" bins in large unannotated regions. It allows to associate a list of genes with a test, which simplifies the analysis of results. The drawbacks are (i) that it makes more difficult the study of these "deserts", however the goal is here to maximize, not the chance of finding an association, but the chance of elucidating a mechanism of a complex disease given the current knowledge (ii) that a bin might contain several haplotype blocks, resulting in a dilution of the association signal if only one block is associated. Reciprocally, neighbor bins are not independent because they may share a haplotype block. However, with the classical strategy, correlated neighbor SNPs would also be tested separately. Thirdly, genome-wide genotyping data are obtained by high-throughput experiments which encompass limitations requiring careful statistical methodology. Especially, with Affy. technology, the trade-off between the call rate (i.e. errors detected by the genotyping process and resulting in missing genotypes in the data set) and the error rate (i.e. errors left in the data) is difficult to adjust. Obtaining unbiased statistical results is then conditioned to good pre-processing filters. Indeed spurious markers must be eliminated and missing data correctly managed. In addition, for most of SNPs used in this study, some genotypes are held by less than few percents of patients, which, given the usual collection size of a few hundreds, (i) is not enough for good asymptotic approximations and (ii) should be considered with care given possible high error rate. Finally, whatever algorithmic solution is developed, because the number of markers available will probably quickly reach a few millions, creating a scalability problem, it has to be linear in the number of markers. In this paper we present a novel Bayesian algorithm developed to easily analyze genome-wide association studies. This algorithm is based on a gene-based partitioning of DNA into regions, called bins. A p-value of association is computed for each bin. The model takes into account genotyping errors and missing data and tries to detect simple differences in the haplotype block structure between cases and controls. The study of different collections is allowed. The multiple testing problem is addressed by estimation of FDR. The method has been applied to analyze the results of three genome-wide case-control association studies of the complex disease Multiple Sclerosis (MS). It identifies putatively associated bins, containing genes previously described to be linked to MS (see [8] for review) as well as new candidate genes. Materials Three association studies dealing with Multiple Sclerosis (MS) in three independent collections have been realized. Around 600 patients have been recruited for each study, half of them as cases affected by the disease, half of them as controls (Table 1). Genotypes of the 116 204 SNPs have been determined for each patient using Affymetrix GeneChip® human mapping 100 K technology (Affy. technology). Methods Notations Stochastic variables are noted with a round letter ( ), a realization is noted in lower case (v). Indices are noted in lower case (k), ranging from 1 to the corresponding upper case letter (K). Unless needed, this range of indices (k ∈ [1, K]) is omitted. The number of different values is noted #( ). The n-dimensional table of the number of individuals having the same combination of values for given var
منابع مشابه
Conception objet dans le cadre des systèmes d'information spatiaux: Agrégation spatiale et généralisation
Notre propos est de témoigner de l'apport du paradigme objet et notamment du formalisme UML dans l'élaboration et la confrontation de modèles conceptuels. Traitant des problèmes de gestion de l'espace rural, nous portons notre attention sur les notions de représentation du paysage. En particulier, nous précisons l'intérêt du concept d'agrégation utilisé à des fins structurelles et dynamiques. N...
متن کاملSerre's Reduction of Linear Functional Systems
Serre’s reduction aims at reducing the number of unknowns and equations of a linear functional system (e.g., system of partial differential equations, system of differential time-delay equations, system of difference equations). Finding an equivalent representation of a linear functional system containing fewer equations and fewer unknowns generally simplifies the study of its structural proper...
متن کاملVers l'Intégration des Propriétés non Fonctionnelles dans le Langage SADL
Résumé. La notion d’architecture logicielle est apparue aux alentours des années 1990 et est maintenant présentée comme le cœur d’une discipline à part entière. De nombreux langages de description d’architecture (ADLs) ont été proposés dans la littérature. Ils offrent des capacités complémentaires pour le développement et l’analyse architecturale d’un système logiciel. Comme l’objectif principa...
متن کاملDiscrete approximation of the Mumford-Shah functional in dimension two
The Mumford-Shah functional, introduced to study image segmentation problems, is approximated in the sense of F-convergence by a séquence of intégral functionals defined on piecewise affine functions. Résumé. La fonctionnelle de Mumford et Shah, proposée pour l'étude du problème de la segmentation d'images, est approchée au sens de la F-convergence par une suite de fonctionnelles intégrales déf...
متن کاملÉtude de performance des systèmes de découverte de ressources
Résumé Les grilles de PC (Desktop Grid) sont une technologie qui consiste à exploiter des ressources géographiquement dispersées, pour traiter des applications complexes demandant une grande puissance de calcul et une capacité de stockage importante. Cependant, comme le nombre de ressources augmente, les besoins de changement d’échelle, d’auto-organisation, de reconfiguration dynamique, de déce...
متن کاملÉtude sur les portails et agrégateurs des ressources pédagogiques universitaires francophones en accès libre
A ces trois grands objectifs stratégiques, deux autres objectifs (ou exigences) d'ordre technologique et culturel sont également à prévoir dans la construction d'un portail francophone commun de ressources pédagogiques gratuites : 1. D'abord, une exigence technique (et technologique) de convergence et de cohérence avec les pratiques internationales dans la conception et la diffusion des ressour...
متن کامل